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ABSTRACT 

In this paper, we consider multi-sensor classification when 
there is a large number of unlabeled samples. The problem 
is formulated under the multi-view learning framework and 
a Consensus-based Multi-View Maximum Entropy Discrim¬ 
ination (CMV-MED) algorithm is proposed. By iteratively 
maximizing the stochastic agreement between multiple clas¬ 
sifiers on the unlabeled dataset, the algorithm simultaneously 
learns multiple high accuracy classifiers. We demonstrate that 
our proposed method can yield improved performance over 
previous multi-view learning approaches by comparing per¬ 
formance on three real multi-sensor data sets. 

Index Terms — sensor networks, multi-view learning, 
maximum entropy discrimination, kernel machine 

1. INTRODUCTION 

In many applications, e.g., in sensor networks, data is col¬ 
lected from multiple sensors and, given that complementary 
information is present within different sensors, classification 
using all sensors is expected to yield higher performance as 
compared to its single-sensor counterpart HI. Eurthermore, 
as class labeling can be labor intensive, in many situations 
many training samples may not be labeled. In the machine 
learning literature, this problem falls under the framework of 
semi-supervised multi-view learning 0, since the partially- 
labeled samples are multi-modal in nature and each modality 
corresponds to one view of physical event. 

Most methods to multi-sensor or multi-view classification 
either rely on feature fusion (early fusion) methods, that find 
an intermediate joint representation of multiple views Ell, 
or, on decision fusion (late fusion) methods that combine de¬ 
cisions from multiple models to improve the overall perfor¬ 
mance 0. Unless the features are optimized for multi-view 
aggregation, there is no guarantee that feature fusion will lead 
to good classification performance. In this paper, we pursue 
a different approach that learns an intermediate model, or a 
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consensus view to fuse features from different views, and im¬ 
proves simultaneously the performance of each single-view 
classifier. Moreover, we propose to train a set of stochastic 
classifiers to handle the large number of unlabeled training 
samples. 

We follow the principle of the disagreement-based multi¬ 
view learning E 0 El 0 E [TO] [TTl. In particular, it is shown 
in na that the error rate of each classifier in the multi-view 
system is bounded above by the rate of disagreement be¬ 
tween multiple view-specific classifiers. In other word, the 
algorithm that explicitly minimizes the disagreement between 
multiple view-specific classifiers would learn a set of compat¬ 
ible classifiers with high performance and low sample com¬ 
plexity. In this paper, we propose a Consensus-based Multi- 
View Maximum Entropy Discrimination (CMV-MED) algo¬ 
rithm that learns a set of classifiers, one for each view, by 
iteratively maximizing their stochastic agreement on the un¬ 
labeled training data. Our method is based on the Maximum 
Entropy Discrimination (MED) by Jaakkola et al. ifTSll . MED 
is a Bayesian learning approach that generalizes support vec¬ 
tor machine (SVM) classifiers and explicitly incorporate the 
large-margin training Cl into a unified maximum entropy 
learning framework. We show the superior performance of 
our model over previous multi-view learning approaches by 
comparing performance on three real multi-sensor data sets. 

This paper is structured as follows: an overview of the 
MED model is given in Section|2]and we propose the general 
model for CMV-MED in Section[3] The algorithm for solving 
CMV-MED is discussed in Section |4] In Section |5] experi¬ 
ments on a set of real multi-view data sets are discussed. 

2. MAXIMUM ENTROPY DISCRIMINATION (MED) 

We denote the multi-view data set as Vv- T>v consists of 
the labeled part {(x„,?/„),n G T} and the unlabeled part 
{xm, TO € [/}, where L and U represent the index set of la¬ 
beled and unlabeled samples, respectively, and |L| « \u\. 
Define the multi-view feature x„ = [x^, ■ • ■, ], Vn € 

Lull, where x^ € Vfi' are the features extracted from view 
i and V is the number of views. Here we consider the binary 
classification task, i.e., y € \y\ = {—1, -fl} ■ Let D® be the 



set of samples collected from the single view i. In this section, 
we focus on the single-view MED on labeled subset L. 

For a single view i G assume the pre¬ 

dictive distribution is a generalized log-linear model, i.e., 
logpi( 2 /|x% Wi) oc iy (wf $,(x*)) = F,(y,x;wi) and 
<i>i : i-G is a prescribed feature map dehned in view 

i. Dehne the kernel function Ki : x 72.'^* i —TZ that sat¬ 

isfies ($i(x^), $i(x^)) = it:i(x„,x™), for Vx^,x5„ G D* 
in view i and Fi{y,x^;Wi) is the normalized log-likelihood 
function parameterized by Wi in the kernel space. 

Denote the prior distribution of Wi as po(wi). The goal for 
Maximum Entropy Discrimination ini is to learn a post-data 
(posterior) distribution q{wi), by solving an entropic regu¬ 
larized risk minimization problem with the prior on model 
parameter specihed as po{wi) 

min KE (g(wi)||po(wi)) 

9(wt) 

-I-^ [l - E,j(wi){AT’i(i/„,x^;Wi)}] , (1) 

nen ^ 

where [s]+ = max{s, 0}. IKL(p||g) is the Kullback- 

Leibler divergence from distribution p to q, i.e., 

KE (g(w,)||po(w0) = /e 9 (w*) log d^i and 

AF,(y„,x„; Wj) = xj,; w^) - Fi{y ^ j/„,x^;w,) = 

log log-odds classiher. 

The second term in ([T]) is a hinge-loss that captures the 
large-margin principle underlying the MED prediction rule, 

y* = argmaxj^ [F(y,x*;w,)] . 

If we use a Gaussian Process ES as the prior on w^, 
i.e., po(wi) = A/'(wi; 0,a^lp.), a kernel SVM is obtained 
by solving O in its dual formulation. For multi-view data, 
it is necessary to learn multiple MEDs simultaneously. For 
example, in na, the author applies a joint sparsity prior on 
(w^,..., ) to achieve multi-task feature selection. Instead 

of assuming a joint prior on all multi-view model parame¬ 
ters, we utilize the available unlabeled samples and require 
the class prediction of multiple models to agree with each 
other. 

3. CONSENSUS-BASED MULTI-VIEW MED: A 
GENERAL FRAMEWORK 

Dehne the consensus view model as a parameter- 
free distribution q(?/|x„) G Q on the unlabeled 
set U, where x„ = [x^,..., x^], Vn G U, 
Q = {q{x) : q{x) >0,J q{x)dx = 1} andq(?/|x„) = 

d {y = Vn} , n € L. In each view i, a joint post-data dis¬ 
tribution is obtained as qi{y,Wi\x) = q{y\x.)q{wi), where 
q(t/|x) is shared among all views and the above equality 
rehects the mean-held approximation. 

The goal of Consensus-based Multi-view Maximum En¬ 
tropy Discrimination (CMV-MED) is to simultaneously learn 
the joint post-data distributions qi{y,Wi\x) = q{y\x)q{wi), 
given the priors pi{y, Wj|x*) = pi{y\wi,x')po{wi) for x* G 


= 1,..., y. This is accomplished by solving the fol¬ 
lowing optimization problem 

min -E,.(y,w"|x„){AUi(y,x;;Wi)}l 

g.j(y,w^|xn)eQ, + 

V 

+A y^TTiKE (^gi(i/,w‘|x„)||po(y,w‘|x^)^ , 
n^U i = l 

( 2 ) 

where nt G = 1; ’’'j ^ ^ parame¬ 

ter for view i and A > 0 is regularization parameter. Note that 
gi(t/, Wi|x„) = 5 {y = y„}q(wi) on the labeled set L and 
the second term can be further expanded as 

KE (^qi(y,w*|x„)||po(l/,w*|x;)^ = KE (^q(w*)||po(w*)^ 

-I-Eg(wi) [kE (^g(i/|x„)||pi(i/|x^,w')^] ,i = l,...,V. (3) 

Substituting Q into (|2|i, we have the following 

V 

g(w^), Vi = l,...,V —1 

V 

+AE^iKE (^g(w*)||po(w*)^ 

i=l 

V 

TAEE’^'^-jtwO [kE (^(7(2/|x„)||pi(y|x;,w*)^] . (4) 

n^U i=l 

From (HJi, we see that the hrst and second term learn V view- 
specihc MED models g(w®), z = 1,.., V, simultaneously. 

Our main contribution is the third term in dUl, which is 
referred as the consensus-based disagreement term on unla¬ 
beled set, since it is zero when view-specihc predictive mod¬ 
els pi(y|x^,w*) all equal, i = while it penalizes 

more when one deviates far from the consensus model q{y\x), 
which, by construction, is the center of these V distributions 
in the information geometry over the space of probability 
measures. This center is determined by information projec¬ 
tion accomplished by the KL divergence in (|4|l. By incorpo¬ 
rating this term, we explicitly require all classifiers to make 
similar class predictions having similar confidence levels on 
the unlabeled training samples. The benefit for enforcing the 
consensus-based disagreement is that the proposed model is 
sensitive in the case when view-specific classifiers with low 
conhdence agree with each other, while it is lenient when all 
of them are highly conhdent and agree. Thus the model is re¬ 
liable in the situation where the initial view-specifc classihers 
only have low confidence results due to the limited size of la¬ 
beled training set. Fig. [T]is a graphical model representation 
for the information projection. 

4. SOLUTION VIA DETERMINISTIC ANNEALING 
EXPECTATION MAXIMIZATION 

Our solution for CMV-MED in (|4|i is based on the determinis¬ 
tic annealing EM ini. It is described as the following steps; 





pi q p 2 


Fig. 1: A graphical model representation for consensus-based multi-view 
learning via information projection. 

1. Set the regularization parameter Aq = 0 in (|4|i at initial¬ 
ization and train V independent MED classifiers simul¬ 
taneously to find go(w^), i = 1,..., V. Set the prior 
distribution po(w®) = A/"(w* ; 0, cr^/) and tt^ = y,yi. 
Let T be the maximum number of iterations. 


2. For t = 1,..., T, do 

(a) Given the post-data distribution gt_i(w®),z = 
1,..., V from MED, find the consensus view on 
unlabeled data U via information projection, i.e. 

'7t(2/|x„) 

^ V 

= argmin^ — ^ [kE (gn(y)|bi,n(y|w‘))] 

i = l 

1 ^ 

^ loglJt(l/|Xn) = y'^^OgPi,n{y\'wl_i) - logZ(Xn), 

Vn e U, 


where q„{y) = q{y\xn), Pi,«(t/|w*) = p*( 2 /|xj,, 
w*) for n G U, Z{xn) is the normalization factor 
and 'w\_i is the mean of the post-data distribution 

(b) Given the consensus view gi(y|x„), Vn G U, sub¬ 
stitute it into (|4|i to obtain the following optimiza¬ 
tion problem 


V' 

.min V V [l - E („i){AEi(y„,xb Wi)}l 

n£L 1 = 1 

+ >^ty [E5t(!/|x„) [-logPi(l/|Xn.W*)]] 

nGU i = l 

V 

-f ^TTiKE (^g(w*)||po(w*)j 


For each view i, compute the a®) with 

dual parameter a* = [aj,..., by solving the 
following dual programming problem, i.e., 

i ^ I (fj" T\ i /c\ 

maxi a -—(a) (KiOyy )a (5) 

a* 2 

s.t. 0 ^ ^ 1, 

where 1 = [1,..., 1]^ and © is piece-wise product. 
In dD, a new kernel is computed via 

Ki = Ki,i 


-At {WulV -h AtKt/.t] ' kid 6 ) 

= [(5i(Xn) , 5i(x^))]„,^6Z,, (7) 


where 

Kl,, 


^u,^ 

= 

[^*«:X^)]n,mGC/ and 

^UL = 

[a:,(x: 


diag {vi, 

..., vu} 


T'<?t(y|xn) 

logPz(j/|x^, w®_i)] , n e (7. 


Then the post-data distribution gt(w*|X>®, a®) = 
A/’(wJ,Hi), where the mean is given by wj = 
Sm=i The covariance matrix 

H, = (a2/-p$,(Xc7)^M,$,(Xc/)) with 

$(Xc/) = [$,(xl),..., $t(x[t)]^ e 

(c) Set At = 1 — —5> 1 as f increases. 

(d) t i — f -p 1. 

3. Finally, make prediction based on consensus view 

y* = argmaxg ^ {y = y} F{y, x®; wt)] . 

l<i<V 

Note that the Step 2(b) can be performed in parallel, as it does 
not rely on information from other views. 

5. EXPERIMENTS 

We compare the proposed CMV-MED model with the SVM- 
2K model proposed by Farquhar et al. ||7l, the MV-MED 
model by Sun et al. HU as well as the conventional MED 
for each view on several real multi-view data sets. In the 
following experiments, we focus on two-view learning, i.e. 
V = 2 and use the Gaussian Kernel function xj„) = 

exp(c ||xjj — xj„|p), i = 1,2. For all MED-based methods, 
a Gaussian Process prior po(w®) = A/’(0 ,(t|/) is assigned 
for view i = 1,2. The view parameter tti = 712 = ^. All 
other parameters for each model are obtained by 5-fold-cross- 
validation. All the experiments are repeated for 20 times, with 
randomly chosen L and U. 

5.1. Footstep Classification 

We test on ARL-Footstep |[T8] [T3 data, which is a multi¬ 
sensor data set that contains acoustic signals collected by four 
well-synchronized sensors (labeled as Sensor 1,2,3,4) in a 
natural environment. The task is to discriminate between hu¬ 
man footsteps and human-leading animal footsteps. We only 
use Sensor 1, 2 in our experiment. It involves 840 segments 
from human subjects and 660 segments from human-animal 
subjects. We choose 600 segments from each class as the 
training set with \L\ = 50, and the rest is designated as the 
test set. A 200-dimensional mel-frequency cepstral coeffi¬ 
cients (MFCCs) vector is computed from the corresponding 
segments in all the views, with normalization as in IIT9l . 

In Table [T] we see that our CMV-MED outperforms both 
SVM-2K and MV-MED, and it improves over the single¬ 
view MED. This is likely because our method utilizes the 
confidence as well as decision as a disagreement measure. 





Classification Accuracy (%) mean di standard error 

Dataset. 

MED (single views) 

SVM-2K 

MV-MED 

CMV-MED 

ARL Footstep (Sensor 1,2, 

|L| = 50) 

71.1 ±5.3 

62.3 ± 10.2 

73.3 ±5.2 

75.6 ±6.5 

85.5 ±6.1 

WebKB4(|L| = 15) 

76.6 ± 10.2 

77.1 ± 10.1 

79.0 ±10.0 

77.9 ±8.7 

91.7± 5.8 

Internet Ads (|i)| = 50) 

87.3 ±0.9 

86.2 ±1.4 

82.5 ±4.3 

88.8 ±2.3 

92.7 ± 0.7 


Table 1: Classification accuracy with different data set, with the best performance shown in bold. 
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Fig. 2: The classification accuracy vs. the size of labeled set for (a) ARL-Footstep data set, (b) WebKB4 data set and (c) Internet Ads data set. The proposed 
CMV-MED outperforms MV-MED, SVM-2K and two single-view MEDs (view 1 and 2) and it has good stability when the number of labeled samples is small. 


In ARL-Footstep data, since the signal is contaminated by 
background noise, the original MED on two single views does 
not perform well, and both the decision regularization and 
margin regularization are not as reliable as the conhdence reg¬ 
ularization implemented by CMV-MED. 

Fig. 13 a) shows the accuracy and the standard deviation 
for the four methods as the size of the labeled set increases. 
As more ground truth labels are used, the performances of 
all training methods increases, while CMV-MED shows its 
superior performance consistently. 

5.2. Web-Page Classification 

The WebKB4 ll20l data set is widely-used in multi-view 
learning literature Enni. It consists of 1051 two-view web 
pages collected from computer science department web sites 
at four universities. There are 230 course pages and 821 non¬ 
course pages. The two natural views are words in a web page 
and words appearing in the links pointing to that page. We 
follow the preprocessing step in ifTOl . and extract a 3000- 
dimensional feature vector via the bag-of-words representa¬ 
tion in the page view and a 1840-dimensional feature vector 
in the link view. Then we compute the term frequency-inverse 
document frequency weights (TF-IDF) features from the doc¬ 
ument word matrix. The feature vector is length normalized. 

In Table [1] we see that our CMV-MED has significantly 
better performance as compared to SVM-2K and MV-MED, 
when the labeled set is small, i.e., \L\ = 15. Also, accord¬ 
ing to Fig. |3b), when more labeled samples are included, 
all four methods have similarly good performance, even for 
the single-view MED. The CMV-MED performs better with 
a few labeled samples because its stability relies on a good es¬ 
timate of confidence on the unlabeled training samples, which 
is less affected by the amount of the labeled training samples. 


5.3. Internet Advertisement Classification 

The Internet Ads ll^ data set consists of 3279 instances 
including 458 ads images and 2820 non-ads images. The 
first view describes the image itself, i.e., words in images’ 
URL and caption, while the other view contains all other fea¬ 
tures, i.e., words from URLs of pages that contain the im¬ 
age and pages which the image points to. For each view, 
we extract the bag-of-words representations, which results in 
a 587—dimensional vector in view 1 and a 967—dimension 
vector in view 2. We set the size of training set as 600 and 
\L\ = 50. 

From Table [T] and Fig. |3c) , we see that our CMV-MED 
still performs better than SVM-2K, MV-MED and single¬ 
view MED. It is seen that CMV-MED is more stable as the 
size of the labeled training set increases, while SVM-2K has 
much worse stability performance. 

6. CONCLUSION 

In this paper, we propose a consensus-based multi-view max¬ 
imum entropy learning model that incorporates large-margin 
classification and Bayesian learning when a large amount of 
unlabeled samples from multiple sources are available. The 
experimental results on three different real data sets show the 
superiority of the proposed CMV-MED over other multi-view 
large-margin classification methods in terms of classification 
accuracy, especially when the number of labeled samples is 
small compared to the unlabeled ones. 
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